home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Games of Daze
/
Infomagic - Games of Daze (Summer 1995) (Disc 1 of 2).iso
/
x2ftp
/
msdos
/
docs
/
winer
/
chap1.txt
< prev
next >
Wrap
Text File
|
1994-09-04
|
42KB
|
726 lines
CHAPTER 1
AN INTRODUCTION TO COMPILED BASIC
This chapter explores the internal workings of the BASIC compiler. Many
people view a compiler simply as a "black box" which magically transforms
BASIC source files into executable code. Of course, magic does not play
a part in any computer program, and the BC compiler that comes with
Microsoft BASIC is no exception. It is merely a program that processes
data in the same way any other program would. In this case, the data is
your BASIC source code.
You will learn here what the BASIC compiler does, and how it does it.
You will also get an inside glimpse at some of the decisions a compiler
must make, as it transforms your code into the assembly language commands
the CPU will execute. By truly understanding the compiler's role, you will
be able to exploit its strengths and also avoid its weaknesses.
COMPILER FUNDAMENTALS
=====================
No matter what language a program is written in, at some point it must be
translated into the binary codes that the PC's processor can understand.
Unlike BASIC commands, the CPU within every PC is capable of acting on only
very rudimentary instructions. Some typical examples of these instructions
are "Add 3 to the value stored in memory location 100", and "Compare the
value stored at address 4012 to the number -12 and jump to the code at
address 2015 if it is less". Therefore, one very important value of a
high-level language such as BASIC is that a programmer can use meaningful
names instead of memory addresses when referring to variables and
subroutines. Another is the ability to perform complex actions that
require many separate small steps using only one or two statements.
As an example, when you use the command PRINT X% in a program, the value
of X% must first be converted from its native two-byte binary format into
an ASCII string suitable for display. Next, the current cursor location
must be determined, at which point the characters in the string are placed
into the screen's memory area. Further, the cursor position has to be
updated, to place it just past the digits that were printed. Finally, if
the last digit happened to end up at the bottom-right corner of the screen,
the display must also be scrolled up a line. As you can see, that's an
awful lot of activity for such a seemingly simple statement!
A compiler, then, is a program that translates these English-like BASIC
source statements into the many separate and tiny steps the microprocessor
requires. The BASIC compiler has four major responsibilities, as shown in
Figure 1-1 below.
1. Translate BASIC statements into an equivalent series of assembly
language commands.
2. Assign addresses in memory to hold each of the variables being used
by the program.
3. Remember the addresses in the generated code where each line number
or label occurs, for GOTO and GOSUB statements.
4. Generate additional code to test for events and detect errors when
the /v, /w, or /d compile options are used.
Figure 1-1: The primary actions performed by a BASIC compiler.
As the compiler processes a program's source code, it translates only the
most basic statements directly into assembly language. For other, more
complex statements, it instead generates calls to routines in the BASIC
run-time library that is supplied with your compiler. When designing a
BASIC program you would most likely identify operations that need to be
performed more than once, and then create subprograms or functions rather
than add the same code in-line repeatedly. Likewise, the compiler takes
advantage of the inherent efficiency of using called subroutines.
For example, when you use a BASIC statement such as PRINT Work$, the
compiler processes it as if you had used CALL PRINT(Work$). That is, PRINT
really is a called subroutine. Similarly, when you write OPEN FileName$
FOR RANDOM AS #1 LEN = 1024, the compiler treats that as a call to its Open
routine, and it creates code identical to CALL OPEN(FileName$, 1, 1024, 4).
Here, the first argument is the file name, the second is the file number
you specified, the third is the record length, and the value 4 is BASIC's
internal code for RANDOM. Because these are BASIC key words, the CALL
statement is of course not required. But the end result is identical.
While the BC compiler could certainly create code to print the string
or open the file directly, that would be much less efficient than using
subroutines. Indeed, all of the subroutines in the Microsoft-supplied
libraries are written in assembly language for the smallest size and
highest possible performance.
DATA STORAGE
The second important job the compiler must perform is to identify all of
the variables and other data your program is using, and allocate space for
them in the object file. There are two kinds of data that are manipulated
in a BASIC program--static data and dynamic data. The term static data
refers to any variable whose address and size does not change during the
execution of a program. That is, all simple numeric and TYPE variables,
and static numeric and TYPE arrays. String constants such as "Press a key
to continue" and DATA items are also considered to be static data, since
their contents never change.
Dynamic data is that which changes in size or location when the program
runs. One example of dynamic data is a dynamic array, because space to
hold its contents is allocated when the program runs. Another is string
data, which is constantly moved around in memory as new strings are
assigned and old ones are erased. Variable and array storage is discussed
in depth in Chapter 2, so I won't belabor that now. The goal here is
simply to introduce the concept of variable storage. The important point
is that BC deals only with static data, because that must be placed into
the object file.
As the compiler processes your source code, it must remember each
variable that is encountered, and allocate space in the object file to hold
it. Further, all of this data must be able to fit into a single 64K
segment, which is called DGROUP (for Data Group). Although the compiled
code in each object file may be as large as 64K, static data is combined
from all of the files in a multi-module program, and may not exceed 64K in
total size. Note that this limitation is inherent in the design of the
Intel microprocessors, and has nothing to do with BC, LINK, or DOS.
As each new variable is encountered, room to hold it is placed into the
next available data address in the object file. (In truth, the compiler
retains all variable information in memory, and writes it to the end of the
file all at once following the generated code.) For each integer variable,
two bytes are set aside. Long integer and single precision variables
require four bytes each, while double precision variables occupy eight
bytes. Fixed-length string and TYPE variables use a varying number of
bytes, depending on the components you have defined.
Static numeric and TYPE arrays are also written to the object file by
the compiler. The number of bytes that are written of course depends on
how many elements have been specified in the DIM statement. Also, notice
that no matter what type of variable or array is encountered, only zeroes
are written to the file. The only exceptions are quoted string constants
and DATA items, in which case the actual text must be stored.
Unlike numeric, TYPE, and fixed-length variables, strings must be
handled somewhat differently. For each string variable a program uses, a
four-byte table called a *string descriptor* is placed into the object
file. However, since the actual string data is not assigned until the
program is run, space for that data need not be handled by the compiler.
With string arrays--whether static or dynamic--a table of four-byte
descriptors is allocated.
Finally, each array in the program also requires an array descriptor.
This is simply a table that shows where the array's data is located in
memory, how many elements it currently holds, the length in bytes of each
element, and so forth.
ASSEMBLY LANGUAGE CONSIDERATIONS
In order to fully appreciate how the translation process operates, you will
first need to understand what assembly language is all about. Please
understand that there is nothing inherently difficult about assembly
language. Like BASIC, assembly language is comprised of individual
instructions that are executed in sequence. However, each of these
instructions does much less than a typical BASIC statement. Therefore,
many more steps are required to achieve a given result than in a high-level
language. Some of these steps will be shown in the following examples.
If you are not comfortable with the idea of tackling assembly language
concepts just yet, please feel free to come back to this section at a later
time.
Let's begin by examining some very simple BASIC statements, and see how
they are translated by the compiler. For simplicity, I will show only
integer math operations. The 80x86 family of microprocessors can
manipulate integer values directly, as opposed to single and double
precision numbers which are much more complex. The short code fragment in
Listing 1-1 shows some very simple BASIC instructions, along with the
resulting compiled assembly code. In case you are interested,
disassemblies such as those you are about to see are easy to create for
yourself using the Microsoft CodeView utility. CodeView is included with
the Macro Assembler as well as with BASIC PDS.
A% = 12
MOV WORD PTR [A%],12 ;move a 12 into the word variable A%
X% = X% + 1
INC WORD PTR [X%] ;add 1 to the word variable X%
Y% = Y% + 100
ADD WORD PTR [Y%],100 ;add 100 to the word variable Y%
Z% = A% + B%
MOV AX,WORD PTR [B%] ;move the contents of B% into AX
ADD AX,WORD PTR [A%] ;add to that the value of A%
MOV WORD PTR [Z%],AX ;move the result into Z%
Listing 1-1: These short examples show the compiled results of some simple
BASIC math operations.
The first statement, A% = 12, is directly translated to its assembler
equivalent. Here, the value 12 is *moved* into the word-sized address
named A%. Although an integer is the smallest data type supported by
BASIC, the microprocessor can in fact deal with variables as small as one
byte. Therefore, the WORD PTR (word pointer) argument is needed to specify
that A% is a full two-byte integer, rather than a single byte. Notice that
in assembly language, brackets are used to specify the contents of a memory
address. This is not unlike BASIC's PEEK() function, where parentheses are
used for that purpose.
In the second statement, X% = X% + 1, the compiler generates assembly
language code to increment, or add 1 to, the word-sized variable in the
location named X%. Since adding or subtracting a value of 1 is such a
common operation in all programming languages, the designers of the 80x86
included the INC (and complementary DEC) instruction to handle that.
Y% = Y% + 100 is similarly translated, but in this case to assembler
code that adds the value 100 to the word-sized variable at address Y%. As
you can see, the simple BASIC statements shown thus far have a direct
assembly language equivalent. Therefore, the code that BC creates is
extremely efficient, and in fact could not be improved upon even by a human
hand-coding those statements in assembly language.
The last statement, Z% = A% + B%, is only slightly more complicated than
the others. This is because separate steps are required to retrieve the
contents of one memory location, before manipulating it and assigning the
result to another location. Here, the value held in variable B% is moved
into one of the processor's registers (AX). The value of variable A% is
then added to AX, and finally the result is moved into Z%. There are about
a dozen registers within the CPU, and you can think of them as special
variables that can be accessed very quickly.
The next example in Listing 1-2 shows how BASIC passes arguments to its
internal routines, in this case PRINT and OPEN. Whenever a variable is
passed to a routine, what is actually sent is the address (memory location)
of the variable. This way, the routine can go to that address, and read
the value that is stored there. As in Listing 1-1, the BASIC source code
is shown along with the resultant compiler-generated assembler
instructions.
It may also be worth mentioning that the order in which the arguments
are sent to these routines is determined by how the routines are designed.
In BASIC, if a SUB is designed to accept, say, three parameters in a
certain order, then the caller must pass its arguments in that same order.
Parameters in assembler routines are handled in exactly the same manner.
Of course, any arbitrary order could be used, and what's important is
simply that they match.
PRINT Work$
MOV AX,OFFSET Work$ ;move the address of Work$ into AX
PUSH AX ;push that onto the CPU stack
CALL B$PESD ;call the string printing routine
OPEN FileName$ FOR OUTPUT AS #1
MOV AX,OFFSET FileName$ ;load the address of FileName$
PUSH AX ;push that onto the stack
MOV AX,1 ;load the specified file number
PUSH AX ;and push that as well
MOV AX,-1 ;-1 means that a LEN= was not given
PUSH AX ;and push that
MOV AX,2 ;2 is the internal code for OUTPUT
PUSH AX ;pass that on too
CALL B$OPEN ;finally, call the OPEN routine
Listing 1-2: Many BASIC statements create assembler code that passes
arguments to internal routines, as shown above.
When you tell BASIC to print a string, it first loads the address of the
string into AX, and then pushes that onto the stack. The stack is a
special area in memory that all programs can access, and it is often used
in compiled languages to hold the arguments being sent to subroutines. In
this case, the OFFSET operator tells the CPU to obtain the address where
the variable resides, as opposed to the current contents of the variable.
Notice that the words offset, address, and memory location all mean the
same thing. Also notice that calls in assembly language work exactly the
same as calls in BASIC. When the called routine has finished, execution
in the main program resumes with the next statement in sequence.
Once the address for Work$ has been pushed, BASIC's B$PESD routine is
called. Internally, one of the first things that B$PESD does is to
retrieve the incoming address from the stack. This way it can locate the
characters that are to be printed. B$PESD is responsible for printing
strings, and other BASIC library routines are provided to print each type
of data such as integers and single precision values.
In case you are interested, PESD stands for Print End-of-line String
Descriptor. Had a semicolon been used in the print statement--that is,
PRINT Work$;--then B$PSSD would have been called instead (Print Semicolon
String Descriptor). Likewise, printing a 4-byte long integer with a
trailing comma as in PRINT Value&, would result in a call to B$PCI4 (Print
Comma Integer 4), where the 4 indicates the integer's size in bytes.
In the second example of Listing 1-2 the OPEN routine is set up and
called in a similar fashion, except that four parameters are required
instead of only one. Again, each parameter is pushed onto the stack in
turn, followed by a call to the routine. Most of BASIC's internal routines
begin with the characters "B$", to avoid a conflict with subroutines of
your own. Since a dollar sign is illegal in a BASIC procedure name, there
is no chance that you will inadvertently choose one of the same names that
BASIC uses.
As you can see, there is nothing mysterious or even difficult about
assembly language, or the translations performed by the BASIC compiler.
However, a sequence of many small steps is often needed to perform even
simple calculations and assignments. We will discuss assembly language in
much greater depth in Chapter 14, and my purpose here is merely to present
the underlying concepts.
Please note that variable names are not retained after a program has
been compiled. Once BC has finished its job, all references to each
variable name have been replaced with an equivalent memory addresses in the
object file. Further, once LINK has joined the object files and linked
them to the BASIC language libraries, the procedure names are lost as well.
These issues will be explored in much greater detail in Chapter 14.
COMPILER DIRECTIVES
As you have seen, some code is translated by the compiler into the
equivalent assembly language statements, while other code is instead
converted to calls to the language routines in the BASIC libraries. Some
statements, however, are not translated at all. Rather, they are known as
*compiler directives* that merely provide information to the compiler as
it works. Some examples of these non-executable BASIC statements include
DEFINT, OPTION BASE, and REM, as well as the various "metacommands" such
as '$INCLUDE and '$DYNAMIC. Some others are SHARED, BYVAL, DATA, DECLARE,
CONST, and TYPE.
For our purposes here, it is important to understand that DIM when used
on a static array is also a non-executable statement. Because the size of
the array is known when the program is compiled, BC can simply set aside
memory in the object file to hold the array contents. Therefore, code does
not need to be generated to actually create the array. Similarly, TYPE/END
TYPE statements also merely define a given number of bytes that will
ultimately end up in the program file when the TYPE variable is later
dimensioned by your program.
EVENT AND ERROR CHECKING
The last compiler responsibility I will discuss here is the generation of
additional code to test for events and debugging errors. This occurs
whenever a program is compiled using the /d, /w, or /v command line
switches. Although event trapping and debugging are entirely separate
issues, they are handled in a similar manner. Let's start with event
trapping.
When the IBM PC was first introduced, the ability to handle interrupt-
driven events distinguished it from its then-current Apple and Commodore
counterparts. Interrupts can provide an enormous advantage over polling
methods, since polling requires a program to check constantly for, say,
keyboard or communications activity. With polling, a program must
periodically examine the keyboard using INKEY$, to determine if a key was
pressed. But when interrupts are used, the program can simply go about its
business, confident that any keystrokes will be processed. Here's how that
works:
Each time a key is pressed on a PC, the keyboard generates a hardware
interrupt that suspends whatever is currently happening and then calls a
routine in the ROM BIOS. That routine in turn reads the character from the
keyboard's output port, places it into the PC's keyboard buffer, and
returns to the interrupted application. The next time a program looks for
a keystroke, that key is already waiting to be read. For example, a
program could begin writing a huge multi-megabyte disk file, and any
keystrokes will still be handled even if the operator continues to type.
Understand that hardware interrupts are made possible by a direct
physical connection between the keyboard circuitry and the PC's
microprocessor. The use of interrupts is a powerful concept, and one which
is important to understand. Unfortunately, BASIC does not use interrupts
in most cases, and this discussion is presented solely in the interest of
completeness.
Event Trapping
BASIC provides a number of event handling statements that perhaps *could*
be handled via interrupts, but aren't. When you use ON TIMER, for example,
code is added to periodically call a central event handler to check if the
number of seconds specified has elapsed. Because there are so many
possible event traps that could be active at one time, it would be
unreasonable to expect BASIC to set up separate interrupts to handle each
possibility. In some situations, such as ON KEY, there is a corresponding
interrupt. In this case, the keyboard interrupt. However, some events
such as ON PLAY(Count), where a GOSUB is made whenever the PLAY buffer has
fewer than Count characters remaining, have no corresponding physical
interrupt. Therefore, polling for that condition is the only reasonable
method.
The example in Listing 1-3 shows what happens when you compile using the
/v switch. Notice that the calls to B$EVCK (Event Check) are not part of
the original source code. Rather, they show the additional code that BC
places just before each program statement.
DEFINT A-Z
CALL B$EVCK 'this call is generated by BC
ON TIMER(1) GOSUB HandleTime
CALL B$EVCK 'this call is generated by BC
TIMER ON
CALL B$EVCK 'this call is generated by BC
X = 10
CALL B$EVCK 'this call is generated by BC
Y = 100
CALL B$EVCK 'this call is generated by BC
END
HandleTime:
CALL B$EVCK 'this call is generated by BC
BEEP
CALL B$EVCK 'this call is generated by BC
RETURN
Listing 1-3: When the /v compiler switch is used, BC generates calls to a
central event handler at each BASIC statement.
At five bytes per call, you can see that using /v can quickly bloat a
program to an unacceptable size. One alternative is to instead use /w.
In fact, /w can be particularly attractive in those cases where event
handling cannot be avoided, because it lets you specify where a call to
B$EVCK is made: at each line label or line number in your source code. The
only downside to using line numbers and labels is that additional working
memory is needed by BC to remember the addresses in the code where those
labels are placed. This is not usually a problem, though, unless the
program is very large or every line is labeled.
All of the various BASIC event handling commands are specified using the
ON statement. It is important to understand, however, that ON GOTO and ON
GOSUB do not involve events. That is, they are really just an alternate
form of GOTO and GOSUB respectively, and thus do not require compiling with
/w or /v.
Error Trapping
The last compiler option to consider here is the /d switch, because it too
generates extra code that you might not otherwise be aware of. When a
program is compiled with /d, two things are added. First, for every BASIC
statement a call is made to a routine named B$LINA, which merely checks to
see if Ctrl-Break has been pressed. Normally, a compiled BASIC program is
immune to pressing the Ctrl-C and Ctrl-Break keys, except during an INPUT
or LINE INPUT statement. Since much of the purpose of a debugging mode is
to let you break out of an errant program gone berserk, the Ctrl-Break
checking must be performed frequently. These checks are handled in much
the same way as event trapping, by calling a special routine once for each
line in your source code.
Another important factor resulting from the use of /d is that all array
references are handled through a special called routine which ensures that
the element number specified is in fact legal. Many people don't realize
this, but when a program is compiled without /d and an invalid element is
given, BASIC will blindly write to the wrong memory locations. For
example, if you use DIM Array%(1 TO 100) and then attempt to assign, say,
element number 200, BASIC is glad to oblige. Of course, there *is* no
element 200 in that case, and some other data will no doubt be overwritten
in the process.
To prevent these errors from going undetected, BC calls the B$HARY (Huge
Array) routine to calculate the address based on the element number
specified. If B$HARY determines that the array reference is out of bounds,
it invokes an internal error handler and you receive the familiar
"Subscript out of range" message. Normally, the compiler accesses array
elements using as little code as possible, to achieve the highest possible
performance. If a static array is dimensioned to 100 elements and you
assign element 10, BC knows at the time it compiles your program the
address at which that element resides. It can therefore access that
element directly, just as if it were a non-array variable.
Even when you use a variable to specify an array element such as
Array%(X) = 12, the starting address of the array is known, and the value
in X can be used to quickly calculate how far into the array that element
is located. Therefore, the lack of bounds checking in programs that do not
use /d is not a bug in BASIC. Rather, it is merely a trade-off to obtain
very high performance. Indeed, one of the primary purposes of using /d is
to let BC find mistakes in your programs during development, though at the
cost of execution speed.
The biggest complication from BASIC's point of view is when huge
(greater than 64K) arrays are being manipulated. In fact, B$HARY is the
very same routine that BC calls when you use the /ah switch to specify huge
arrays (hence the name HARY). Since extra code is needed to set up and
call B$HARY compared to the normal array access, using /ah also creates
programs that are larger and slower than when it is not used. Further,
because B$HARY is used by both /d and /ah, invalid element accesses will
also be trapped when you compile using /ah.
Overflow Errors
The final result of using /d is that extra code is generated after certain
math operations, to check for overflow errors that might otherwise go
undetected. Overflow errors are those that result in a value too large for
a given data type. For example,
if you multiply two integers and the result exceeds 32767, that causes an
overflow error. Similarly, an underflow error would be created by a
calculation resulting a value that is too small.
When a floating point math operation is performed, errors that result
from overflow are detected by the routines that perform the calculation.
When that happens there is no recourse other than halting your program with
an appropriate message. Integer operations, however, are handled directly
by 80x86 instructions. Further, an out of bounds result is not necessarily
illegal to the CPU. Thus, programs compiled without the /d option can
produce erroneous results, and without any indication that an error
occurred.
To prove this to yourself, compile and run the short program shown in
Listing 1-4, but without using /d. Although the correct result should be
90000, the answer that is actually displayed is 24464. And you will notice
that no error message is displayed!
As with illegal array references, BC would rather optimize for speed, and
give you the option of using /d as an aid for tracking down such errors as
they occur. If you compile the program in Listing 1-4 with the /d option,
then BASIC will report the error as expected.
Since an overflow resulting from integer operations is not technically
an error as far as the CPU is concerned, how, then, can BASIC trap for
that? Although an error in the usual sense is not created, there is a
special flag variable within the CPU that is set whenever such a condition
occurs. Further, a little-used assembler instruction, INTO (Interrupt 4
if Overflow), will generate software Interrupt 4 if that flag is set.
Therefore, all BC has to do is create an Interrupt 4 handler, and then
place an INTO instruction after every integer math operation in the
compiled code. The interrupt handler will receive control and display an
"Overflow" message whenever an INTO calls it. Since the INTO instruction
is only one byte and is also very fast, using it this way results in very
little size or performance degradation.
X% = 30000
Y% = X% * 10
PRINT Y%
Listing 1-4: This brief program illustrates how overflow errors are handled
in BASIC.
COMPILER OPTIMIZATION
Designing a compiler for a language as complex as BASIC involves some very
tricky programming indeed. Although it is one thing to translate a BASIC
source file into a series of assembly language commands, it is another
matter entirely to do it well! Consider that the compiler must be able to
accept a BASIC statement such as X! = ABS(SQR((Y# + Z!) ^ VAL(Work$))), and
reduce that to the individual steps necessary to arrive at the correct
result.
Many, many details must be accounted for and handled, not the least of
which are syntax or other errors in the source code. Moreover, there are
an infinite number of ways that a programmer can accomplish the same thing.
Therefore, the compiler must be able to recognize many different
programming patterns, and substitute efficient blocks of assembler code
whenever it can. This is the role of an *optimizing compiler*.
One important type of optimization is called *constant folding*. This
means that as much math as possible is performed during compilation, rather
than creating code to do that when the program runs. For example, if you
have a statement such as X = 4 * Y * 3 BC can, and does, change that to X
= Y * 12. After all, why multiply 3 times 4 later, when the answer can be
determined now? This substitution is performed entirely by the BC
compiler, without your knowing about it.
Another important type of optimization is BASIC's ability to remember
calculations it has already performed, and use the results again later if
possible. BC is especially brilliant in this regard, and it can look ahead
many lines in your source code for a repeated use of the same calculations.
Listing 1-5 shows a short fragment of BASIC source code, along with the
resultant assembler output.
X% = 3 * Y% * 4
MOV AX,12 ;move the value 12 into AX
IMUL WORD PTR [Y%] ;Integer-Multiply that times Y%
MOV WORD PTR [X%],AX ;assign the result in AX to X%
A% = S% * 100
MOV BX,AX ;save the result from above in BX
MOV AX,100 ;then assign AX to 100
IMUL WORD PTR [S%] ;now multiply AX times S%
MOV WORD PTR [A%],AX ;and assign A% from the result
Z% = Y% * 12
MOV WORD PTR [Z%],BX ;assign Z% from the earlier result
Listing 1-5: These short code fragments illustrate how adept BC is at
reusing the result of earlier calculations already performed.
As you can see in the first part of Listing 1-5, the value of 3 times 4 was
resolved to 12 by the compiler. Code was then generated to multiply the
12 times Y%, and the result is in turn assigned to X%. This is similar to
the compiled code examined earlier in Listing 1-1. Notice, however, that
before the second multiplication of S% is performed, the result currently
in AX is saved in the BX register. Although AX is destroyed by the
subsequent multiplication of S% times 100, the result that was saved
earlier in BX can be used to assign Z% later on. Also notice that even
though 3 * 4 was used first, BC was smart enough to realize that this is
the same as the 12 used later.
While the compiler can actually look ahead in your source code as it
works, such optimization will be thwarted by the presence of line numbers
and labels, as well as IF blocks. Since a GOTO or GOSUB could jump to a
labeled source line from anywhere in the program, there is no way for BC
to be sure that earlier statements were executed in sequence. Likewise,
the compiler has no way to know which path in an IF/ELSE block will be
taken at run time, and thus cannot optimize across those statements.
THE BASIC RUN-TIME LIBRARIES
Microsoft compiled BASIC lets you create two fundamentally different types
of programs. Those that are entirely self-contained in one .EXE file are
compiled with the /o command line switch. In this case, the compiler
creates translations such as those we have already discussed, and also
generates calls to the BASIC language routines contained in the library
files supplied by Microsoft. When your compiled program is subsequently
linked, only those routines that are actually used will be added to your
program.
When /o is not used, a completely different method is employed. In this
case, a special .EXE file that contains support for every BASIC statement
is loaded along with the BASIC program when the program is run from the DOS
command line. As you are about to see, there are advantages and
disadvantages to each method. For the purpose of this discussion I will
refer to stand-alone programs as BCOM programs, after the BCOMxx.LIB
library name used in all versions of QuickBASIC. Programs that instead
require the BRUNxx.LIB library to be present at run time will be called
BRUN programs.
Beginning with BASIC 7 PDS, the library naming conventions used by
Microsoft have become more obscure. This is because PDS includes a number
of variations for each method, depending on the type of "math package" that
is specified when compiling and whether you are compiling a program to run
under DOS or OS/2. These variations will be discussed fully in Chapter 6,
when we examine all of the possible options that each compiler version has
to offer. But for now, we will consider only the two basic methods--BCOM
and BRUN. The primary differences between these two types of programs are
shown in Figure 1-2.
1. BCOM programs require less memory, run faster, and do not require
the presence of the BRUNxx.EXE file when the program is run.
2. BRUN programs occupy less disk space, and also allow subsequent
chaining to other programs that can share the common library code which
is already resident. Chained-to programs also load quickly because the
BRUN library is already in memory.
Figure 1-2: A comparison of the fundamental differences between BCOM and
BRUN programs.
Stand-alone BCOM programs are always larger than an equivalent BRUN program
because the library code for PRINT, INSTR, and so forth is included in the
final .EXE file. However, less *memory* will be required when the program
runs, since only the code that is really needed is loaded into the PC.
Likewise, a BRUN program will take less disk space, because it contains
only the compiled code. The actual routines to handle each BASIC
statements are stored in the BRUNxx.LIB library, and that library is loaded
automatically when the main program is run from DOS.
You might think that since a BRUN program is physically smaller on disk
it will load faster, but this is not necessarily true. When you execute
a BRUN program from the DOS command line, one of the first things it does
is load the BRUN .EXE support file. Since this support file is fairly
large, the overall load time will be much greater than the compiled BASIC
program's file size would indicate. However, if the main program
subsequently chains to another BASIC program, that program will load
quickly because the BRUN file does not need to be loaded a second time.
One other important difference between these two methods is the way that
the BASIC language routines are accessed. When a BCOM program is compiled
and linked, the necessary routines are called in the usual fashion. That
is, the compiler generates code that calls the routines in the BCOM library
directly. When the program is subsequently linked, the procedure names are
translated by LINK into the equivalent memory addresses. That is, a call
to PRINT is in effect translated from CALL B$PESD to CALL ####:####, where
####:#### is a segment and address.
BRUN programs, on the other hand, instead use a system of interrupts to
access the BASIC language routines. Since there is no way for LINK to know
exactly where in memory the BRUNxx.EXE file will be ultimately loaded, the
interrupt vector table located in low memory is used to hold the various
routine addresses. Although many of these interrupt entries are used by
the PC's system resources, many others are available. Again, I will defer
a thorough treatment of call methods and interrupts until Chapter 14. But
for now, suffice it to say that a direct call is slightly faster than an
indirect call, where the address to be called must first be retrieved from
a table.
As an interesting aside, the routines in the BRUNxx.EXE file in fact
modify the caller's code to perform a direct call, rather than an interrupt
instruction. Therefore, the first time a given block of code is executed,
it calls the run-time routines through an interrupt instruction.
Thereafter, the address where the BRUN file has been loaded is known, and
will be used the next time that same block of code is executed. In
practice, however, this improves only code that lies within a FOR/NEXT,
WHILE, or DO loop. Further, code that is executed only once will actually
be much slower than in a BCOM program, because of the added self-
modification (the program changes itself) instructions.
Notice that when BC compiles your program, it places the name of the
appropriate library into the object file. The name BC uses depends on
which compiler options were given. This way you don't have to specify the
correct name manually, and LINK can read that name and act accordingly.
Although QuickBASIC provides only two libraries--one for BCOM programs and
one for BRUN--BASIC PDS offers a number of additional options. Each of
these options requires the program to be linked with a different library.
That is, there are both BRUN and BCOM libraries for use with OS/2, for near
and far strings, and for using IEEE or Microsoft's alternate math
libraries. Yet another library is provided for 8087-only operation.
GRANULARITY
Until now, we have examined only the actions and methods used by the BC
compiler. However, the process of creating an .EXE file that can be run
from the DOS command line is not complete until the compiled object file
has been linked to the BASIC libraries. I stated earlier that when a
stand-alone program is created using the /o switch, only those routines in
the BCOM library that are actually needed will be added to the program.
Unfortunately, that is not entirely accurate. While it is true that LINK
is very smart and will bring in only those routines that are actually
called, there is one catch.
Imagine that you have written a BASIC program which is comprised of two
separate modules. In one file is the main program that contains only in-
line code, and in the other are two BASIC subprograms. Even if the main
program calls only one of those subprograms, both will be added when the
program is linked. That is, LINK can resolve routines to the source file
level only, but cannot extract a single routine from an object module which
contains multiple routines. Since an .LIB library file is merely a
collection of separate object modules, all of the routines that reside in
a given module will be added to a program, even if only one has been
accessed. This property is called *granularity*, and it determines how
finely LINK can remove routines from a library.
In the case of the libraries supplied with BASIC, the determining factor
is which assembly language routines were combined with which other routines
in the same source file by the programmers at Microsoft. In QuickBASIC
4.5, for example, when a program uses the CLS statement, the routines that
handle COLOR, CSRLIN, POS(0), LOCATE, and the function form of SCREEN are
also added. This is true even if none of those other statements have been
used. Fortunately, Microsoft has done much to improve this situation in
BASIC PDS, but there is still room for improvement. In BASIC PDS, CLS is
stored in a separate file, however POS(0), CSRLIN, and SCREEN are still
together, as are COLOR and LOCATE.
Obviously, Microsoft has their reasons for doing what they do, and I
won't attempt to second guess their expertise here. The BASIC language
libraries are extremely complex and contain many routines. (The QuickBASIC
4.5 BCOM45.LIB file contains 1,485 separate assembler procedures.) With
such an enormous number of assembly language source files to deal with, it
no doubt makes a lot of sense to organize the related routines together.
But it is worth mentioning that Crescent Software's P.D.Q. library can
replace much of the functionality of the BCOM libraries, and with complete
granularity. In fact, P.D.Q. can create working .EXE programs from BASIC
source that are less than 800 bytes in size.
SUMMARY
=======
In this chapter, you learned about the process of compiling, and the kinds
of decisions a sophisticated compiler such as Microsoft BASIC must make.
In some cases, the BASIC compiler performs a direct translation of your
BASIC source code into assembly language, and in others it creates calls
to existing routines in the BCOM libraries. Besides creating the actual
assembler code, BASIC must also allocate space for all of the data used in
a program.
You also learned some basics about assembly language, which will be
covered in more detail in Chapter 13. However, examples in upcoming
chapters will also use brief assembly language examples to show the
relative efficiency of different coding styles. In Chapter 2, you will
learn how variables and other data are stored in memory.